Analysis of the stats of the Experimental packages in Bioconductor project.
Here we are going to analyse the Experimental packages of Bioconductor. See the home of the analysis here.
First we read the latest data from the Bioconductor project. There are two files, one with the download stats from 2009 until today and another with the download stats of the software packages, we will only use the first one:
load("stats.RData")
stats <- stats[Category == "Experimental", ]
stats
## Package Year Month Nb_of_distinct_IPs Nb_of_downloads
## 1: SNAData 2017 02 1 1
## 2: SNAData 2017 04 2 4
## 3: SNAData 2017 05 2 3
## 4: SNAData 2016 02 1 1
## 5: SNAData 2016 03 6 6
## 6: SNAData 2016 04 4 5
## 7: SNAData 2016 05 9 9
## 8: SNAData 2016 09 1 1
## 9: SNAData 2016 11 2 3
## 10: SNAData 2015 09 3 6
## ---
## 17793: biotmleData 2017 04 24 42
## 17794: biotmleData 2017 05 47 55
## 17795: biotmleData 2017 06 32 38
## 17796: microRNAome 2017 06 1 1
## 17797: sampleClassifierData 2017 01 5 5
## 17798: sampleClassifierData 2017 02 5 5
## 17799: sampleClassifierData 2017 03 6 9
## 17800: sampleClassifierData 2017 04 8 31
## 17801: sampleClassifierData 2017 05 12 15
## 17802: sampleClassifierData 2017 06 12 16
## Category Date
## 1: Experimental 2017-02-01 01:00:00
## 2: Experimental 2017-04-01 02:00:00
## 3: Experimental 2017-05-01 02:00:00
## 4: Experimental 2016-02-01 01:00:00
## 5: Experimental 2016-03-01 01:00:00
## 6: Experimental 2016-04-01 02:00:00
## 7: Experimental 2016-05-01 02:00:00
## 8: Experimental 2016-09-01 02:00:00
## 9: Experimental 2016-11-01 01:00:00
## 10: Experimental 2015-09-01 02:00:00
## ---
## 17793: Experimental 2017-04-01 02:00:00
## 17794: Experimental 2017-05-01 02:00:00
## 17795: Experimental 2017-06-01 02:00:00
## 17796: Experimental 2017-06-01 02:00:00
## 17797: Experimental 2017-01-01 01:00:00
## 17798: Experimental 2017-02-01 01:00:00
## 17799: Experimental 2017-03-01 01:00:00
## 17800: Experimental 2017-04-01 02:00:00
## 17801: Experimental 2017-05-01 02:00:00
## 17802: Experimental 2017-06-01 02:00:00
There have been 340 Experimental packages in Bioconductor. Some have been added recently and some later.
First we explore the number of packages being downloaded by month:
theme_bw <- theme_bw(base_size = 16)
scal <- scale_x_datetime(date_breaks = "3 months")
ggplot(stats[, .(Downloads = .N), by = Date], aes(Date, Downloads)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Packages downloaded") +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scal +
xlab("")
Figure 1: Packages in Bioconductor with downloads
The number of packages being downloaded is increasing with time almost exponentially. Partially explained with the incorporation of new packages
ggplot(stats[, .(Number = sum(Nb_of_downloads)), by = Date], aes(Date, Number)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Downloads") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
xlab("")
Figure 2: Downloads of packages
Even if the number of packages increase exponentially, the number of the downloads from 2011 grows linearly with time. Which indicates that each time a software package must compete with more packages to be downloaded.
pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads),
ymin = mean(Nb_of_downloads)-1.96*sd(Nb_of_downloads)/sqrt(.N),
ymax = mean(Nb_of_downloads)+1.96*sd(Nb_of_downloads)/sqrt(.N)),
by = Date], aes(Date, Number)) +
geom_errorbar(aes(ymin = ymin, ymax = ymax), width=.1, position=pd) +
geom_point() +
geom_line() +
theme_bw +
ggtitle("Downloads") +
ylab("Mean download for a package") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
xlab("")
Figure 3: Downloads of packages per package
The error bar indicates the 95% confidence interval.
Here we can apreciate that the number of downloads per package hasn’t changed much with time. If something, now there is more dispersion between packages downloads.
This might be due to an increase in the usage of packages or that new packages bring more users. We start knowing how many packages has been introduced in Bioconductor each month.
today <- base::date()
year <- substr(today, 21, 25)
month <- monthsConvert(substr(today, 5, 7))
incorporation <- stats[ , .SD[which.min(Date)], by = Package, .SDcols = "Date"]
histincorporation <- incorporation[, .(Number = .N), by = Date, ]
ggplot(histincorporation, aes(Date, Number)) +
geom_bar(stat="identity") +
theme_bw +
ggtitle("Packages with first download") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
xlab("")
Figure 4: New packages
We can see that there were more than 1500 packages before 2009 in Bioconductor, and since them occasionally there is a raise to 500 new downloads (Which would be new packages being added).
Using a similar procedure we can approximate the packages deprecated and removed each month. In this case we look for the last date a package was downloaded, excluding the current month:
deprecation <- stats[, .SD[which.max(Date)], by = Package, .SDcols = c("Date", "Year", "Month")]
deprecation <- deprecation[Month != month & Year == Year, , .SDcols = "Date"] # Before this month
histDeprecation <- deprecation[, .(Number = .N), by = Date, ]
ggplot(histDeprecation, aes(Date, Number)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Packages without downloads") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
ylab("Last seen packages")
Figure 5: Date where a package was last downloaded
Aproximates to the date when packages were removed from Bioconductor.
xlab("")
## $x
## [1] ""
##
## attr(,"class")
## [1] "labels"
Here we can see the packages whose last download was in certain month, assuming that this means they are deprecated. It can happen that a package is no longer downloaded but is still in Bioconductor repository, this would be the reason of the spike to 3000 packages as per last month. In total there are 326 packages downloaded. We further explore how many time between the incorporation of the package and the last download.
df <- merge(incorporation, deprecation, by = "Package")
timeBioconductor <- unclass(df$Date.y-df$Date.x)/(60*60) # Transform to years
hist(timeBioconductor, main = "Time in Bioconductor", xlab = "Hours")
abline(v = mean(timeBioconductor), col = "red")
abline(v = median(timeBioconductor), col = "green")
(#fig:time.package)Time of packages between first and last download
The mean time of a package in the Bioconductor is…. Not surprisingly the number of packages incorporated before 2009 and still in the repository are of 0 packages. But those packages not removed how do they do in Bioconductor?
We can start comparing the number of downloads (different from 0) by how many IPs download each package.
ggplot(stats, aes(Nb_of_distinct_IPs, Nb_of_downloads, col = Package)) +
geom_point() +
theme_bw +
geom_smooth(method = "lm") +
xlab("Number of distinct IPs") +
ylab("log10(Number of downloads)") +
ggtitle("Downloads by different IP") +
geom_abline(slope = 2) +
guides(col = FALSE)
Figure 6: Downloads and distinct IPs of all months and packages
Each color is a package, the black line represents 2 downloads per IP.
Not surprisingly most of the package has two downloads from the same IP, one for each Bioconductor release (black line). However, there are some packages where few IPs download many times the same package, which may indicate that these packages are mostly installed in a few locations.
ratio <- stats[, .(slope = coef(lm(Nb_of_downloads~Nb_of_distinct_IPs))[2]), by = Package]
ratio <- ratio[order(slope, decreasing = TRUE), ]
ratio <- ratio[!is.na(slope), ]
ratio$Package <- as.character(ratio$Package)
ratio
## Package slope
## 1: breastCancerNKI 4.8135334
## 2: DLBCL 4.2674435
## 3: Neve2006 4.1159367
## 4: bronchialIL13 3.6021538
## 5: Illumina450ProbeVariants.db 3.2879595
## 6: parathyroid 3.2434080
## 7: facsDorit 3.2349722
## 8: DeSousa2013 3.1993847
## 9: geuvPack 3.0943912
## 10: beadarrayExampleData 2.9841274
## ---
## 321: prostateCancerGrasso 1.0261959
## 322: Affymoe430Expr 1.0217286
## 323: ESNSTCC 1.0000000
## 324: prostateCancerTaylor 0.9683099
## 325: Single.mTEC.Transcriptomes 0.9637681
## 326: RITANdata 0.8461538
## 327: biotmleData 0.6381418
## 328: SVM2CRMdata 0.5399736
## 329: M3DExampleData 0.4134078
## 330: MIGSAdata -0.3488372
We can see that the package with more downloads from the same IP is breastCancerNKI, followed by, DLBCL, Neve2006 and the forth one is bronchialIL13. AT the moment I last edited this manually, the first one is for Chip-seq, the second one for flow cytometry, and the third and forth one is for chromatographically separated and single-spectra mass spectral data, maybe few locations use these packages.
I am curious how are the default packages of Bioconductor downloaded, let’s see where they are:
ratio[Package %in% bioc_packages, ]
## Empty data.table (0 rows) of 2 cols: Package,slope
Only BiocInstaller is installed more than once per IP.
Now we explore if there is some seasons cycles in the downloads, as in figure ?? seems to be some cicles.
First we can explore the number of IPs per month downloading each package:
ggplot(stats, aes(Date, Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Distinct IP downloads") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 7: Distinct IP per package
As we can see there are two groups of packages at the 2009 years, some with low number of IPs and some with bigger number of IPs. As time progress the number of distinct IPs increases for some packages. But is the spread in IPs associated with an increase in downloads?
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads per IP") +
ylab("Downloads") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 8: Downloads per year
Surprisingly some package have a big outburst of downloads to 400k downloads, others to just 100k downloads. But lets focus on the lower end:
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads per package every three months") +
ylab("Downloads") +
scal +
ylim(0, 50000)+
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 9: Downloads per year
There are many packages close to 0 downloads each month, but most packages has less than 10000 downloads per month:
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw+
ggtitle("Downloads per package every three months") +
ylab("Downloads") +
scal +
ylim(0, 10000)+
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 10: Downloads per year
As we can see, in general the month of the year also influences the number of downloads. So we have that from 2010 the factors influencing the downloads are the year, and the month.
Maybe there is a relationship between the downloads and the number of IPs per date
ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Ratio") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 11: Ratio downloads per IP per package
We can see some packages have ocasional raises of downloads per IP. But for small ranges we miss a lot of packages:
ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Ratio") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE) +
ylim(1, 5)
Figure 12: Ratio downloads per IP per package
But most of the packages seem to be more or less constant and around 2.
One problem to compare the evolution of the packages is that they started at different moments, and as seen with time the number of downloads have been increasing as well as the number of packages. So we need to normalize the starting dates:
norm <- stats[, .(Norm = as.numeric(Date)/as.numeric(max(Date)),
Downloads = Nb_of_downloads/max(Nb_of_downloads)), by = Package]
ggplot(norm, aes(Norm, Downloads, col = Package)) +
geom_line() +
theme_bw() +
ggtitle("Downloads per stage of the package") +
xlab("Date normalized") +
guides(col = FALSE)
Figure 13: Normalization of dates and downloads
We can observe a tendency to have a decrease of the number of downloads after being includedd in Bioconductor and later it raises again.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.10.4 ggplot2_2.2.1 BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 knitr_1.15.1 magrittr_1.5 munsell_0.4.3
## [5] colorspace_1.3-2 stringr_1.2.0 highr_0.6 plyr_1.8.4
## [9] tools_3.4.0 grid_3.4.0 gtable_0.2.0 htmltools_0.3.6
## [13] yaml_2.1.14 lazyeval_0.2.0 rprojroot_1.2 digest_0.6.12
## [17] tibble_1.3.0 bookdown_0.3 evaluate_0.10 rmarkdown_1.5
## [21] labeling_0.3 stringi_1.1.5 compiler_3.4.0 scales_0.4.1
## [25] backports_1.0.5